A dataset containing all the information I needed did not exist, so I had to create one by merging various datasets sourced from different governmental websites, such as the National Institutes of Health (NIH), GeorgiaData.org, Data.org, and Georgia.gov.
First, I will download and save the dataset regarding the last presidential elections results that will be used at the end to assess whether I have achieved my initial goal. The other datasets I will import will be managed and combined to create the main dataset that will be used for this assignment.
presidential_2024 <-
read.csv('data_boso//2024_US_County_Level_Presidential_Results.csv')
georgia_election <- presidential_2024 |>
filter(state_name == "Georgia") |>
rename(State = state_name) |>
mutate(State = ifelse(State == "Georgia", "GA", State)) |>
rename(pct_gop = per_gop, pct_dem = per_dem, pct_diff = per_point_diff, diff_gop_dem = diff) |>
mutate(
pct_gop = pct_gop * 100,
pct_dem = pct_dem * 100,
other_votes = total_votes - (votes_gop + votes_dem),
pct_other = (other_votes / total_votes) * 100) |>
select(county_fips, county_name, total_votes, pct_gop, pct_dem, pct_other)
head(georgia_election)
## county_fips county_name total_votes pct_gop pct_dem pct_other
## 1 13001 Appling County 8334 81.12551 18.71850 0.1559875
## 2 13003 Atkinson County 3057 76.87275 22.89827 0.2289827
## 3 13005 Bacon County 4839 86.50548 13.32920 0.1653234
## 4 13007 Baker County 1476 59.82385 39.97290 0.2032520
## 5 13009 Baldwin County 18825 50.85790 48.65339 0.4887118
## 6 13011 Banks County 10519 88.96283 10.79951 0.2376652
Data about health index
file_path <- "data_boso/2024 County Health Rankings Georgia Data - v2.xlsx"
data_health <- read_excel(file_path, sheet = 'Select Measure Data', skip = 1)
## New names:
## • `Unreliable` -> `Unreliable...4`
## • `95% CI - Low` -> `95% CI - Low...7`
## • `95% CI - High` -> `95% CI - High...8`
## • `National Z-Score` -> `National Z-Score...9`
## • `95% CI - Low` -> `95% CI - Low...39`
## • `95% CI - High` -> `95% CI - High...40`
## • `National Z-Score` -> `National Z-Score...41`
## • `95% CI - Low` -> `95% CI - Low...43`
## • `95% CI - High` -> `95% CI - High...44`
## • `National Z-Score` -> `National Z-Score...45`
## • `95% CI - Low` -> `95% CI - Low...47`
## • `95% CI - High` -> `95% CI - High...48`
## • `National Z-Score` -> `National Z-Score...49`
## • `Unreliable` -> `Unreliable...50`
## • `95% CI - Low` -> `95% CI - Low...52`
## • `95% CI - High` -> `95% CI - High...53`
## • `National Z-Score` -> `National Z-Score...54`
## • `95% CI - Low` -> `95% CI - Low...77`
## • `95% CI - High` -> `95% CI - High...78`
## • `National Z-Score` -> `National Z-Score...79`
## • `95% CI - Low` -> `95% CI - Low...81`
## • `95% CI - High` -> `95% CI - High...82`
## • `National Z-Score` -> `National Z-Score...83`
## • `National Z-Score` -> `National Z-Score...85`
## • `95% CI - Low` -> `95% CI - Low...87`
## • `95% CI - High` -> `95% CI - High...88`
## • `National Z-Score` -> `National Z-Score...89`
## • `National Z-Score` -> `National Z-Score...91`
## • `95% CI - Low` -> `95% CI - Low...93`
## • `95% CI - High` -> `95% CI - High...94`
## • `National Z-Score` -> `National Z-Score...95`
## • `95% CI - Low` -> `95% CI - Low...99`
## • `95% CI - High` -> `95% CI - High...100`
## • `National Z-Score` -> `National Z-Score...101`
## • `National Z-Score` -> `National Z-Score...104`
## • `95% CI - Low` -> `95% CI - Low...106`
## • `95% CI - High` -> `95% CI - High...107`
## • `National Z-Score` -> `National Z-Score...108`
## • `95% CI - Low` -> `95% CI - Low...132`
## • `95% CI - High` -> `95% CI - High...133`
## • `National Z-Score` -> `National Z-Score...134`
## • `National Z-Score` -> `National Z-Score...138`
## • `National Z-Score` -> `National Z-Score...142`
## • `National Z-Score` -> `National Z-Score...146`
## • `National Z-Score` -> `National Z-Score...148`
## • `National Z-Score` -> `National Z-Score...155`
## • `National Z-Score` -> `National Z-Score...162`
## • `Population` -> `Population...169`
## • `95% CI - Low` -> `95% CI - Low...171`
## • `95% CI - High` -> `95% CI - High...172`
## • `National Z-Score` -> `National Z-Score...173`
## • `Population` -> `Population...175`
## • `95% CI - Low` -> `95% CI - Low...177`
## • `95% CI - High` -> `95% CI - High...178`
## • `National Z-Score` -> `National Z-Score...179`
## • `National Z-Score` -> `National Z-Score...183`
## • `95% CI - Low` -> `95% CI - Low...185`
## • `95% CI - High` -> `95% CI - High...186`
## • `National Z-Score` -> `National Z-Score...187`
## • `National Z-Score` -> `National Z-Score...196`
## • `95% CI - Low` -> `95% CI - Low...200`
## • `95% CI - High` -> `95% CI - High...201`
## • `National Z-Score` -> `National Z-Score...202`
## • `National Z-Score` -> `National Z-Score...205`
## • `95% CI - Low` -> `95% CI - Low...208`
## • `95% CI - High` -> `95% CI - High...209`
## • `National Z-Score` -> `National Z-Score...210`
## • `National Z-Score` -> `National Z-Score...233`
## • `National Z-Score` -> `National Z-Score...235`
## • `95% CI - Low` -> `95% CI - Low...237`
## • `95% CI - High` -> `95% CI - High...238`
## • `National Z-Score` -> `National Z-Score...248`
## • `95% CI - Low` -> `95% CI - Low...250`
## • `95% CI - High` -> `95% CI - High...251`
## • `National Z-Score` -> `National Z-Score...252`
## • `95% CI - Low` -> `95% CI - Low...270`
## • `95% CI - High` -> `95% CI - High...271`
## • `National Z-Score` -> `National Z-Score...272`
georgia_health <- data_health |>
filter(FIPS != 13000) |>
select(County,
`Food Environment Index`,
`% Uninsured`,
`Average Daily PM2.5`,
`% Severe Housing Problems`,
`% Fair or Poor Health`) |>
mutate(County = paste0(County, " County")) |>
rename(county_name = County)
head(georgia_health)
## # A tibble: 6 × 6
## county_name `Food Environment Index` `% Uninsured` `Average Daily PM2.5`
## <chr> <dbl> <dbl> <dbl>
## 1 Appling County 5.4 20.0 9.1
## 2 Atkinson County 6.2 26.3 8.9
## 3 Bacon County 7.5 18.8 8.9
## 4 Baker County 4.9 17.5 9.4
## 5 Baldwin County 6.9 16.4 9.9
## 6 Banks County 8.2 21.3 9.4
## # ℹ 2 more variables: `% Severe Housing Problems` <dbl>,
## # `% Fair or Poor Health` <dbl>
Data about the population traits
data_pop <- data_health <- read_excel(file_path, sheet = 'Additional Measure Data', skip = 1)
## New names:
## • `95% CI - Low` -> `95% CI - Low...5`
## • `95% CI - High` -> `95% CI - High...6`
## • `# Deaths` -> `# Deaths...28`
## • `95% CI - Low` -> `95% CI - Low...30`
## • `95% CI - High` -> `95% CI - High...31`
## • `# Deaths` -> `# Deaths...53`
## • `95% CI - Low` -> `95% CI - Low...55`
## • `95% CI - High` -> `95% CI - High...56`
## • `# Deaths` -> `# Deaths...78`
## • `95% CI - Low` -> `95% CI - Low...80`
## • `95% CI - High` -> `95% CI - High...81`
## • `95% CI - Low` -> `95% CI - Low...104`
## • `95% CI - High` -> `95% CI - High...105`
## • `95% CI - Low` -> `95% CI - Low...107`
## • `95% CI - High` -> `95% CI - High...108`
## • `95% CI - Low` -> `95% CI - Low...110`
## • `95% CI - High` -> `95% CI - High...111`
## • `95% CI - Low` -> `95% CI - Low...120`
## • `95% CI - High` -> `95% CI - High...121`
## • `95% CI - Low` -> `95% CI - Low...144`
## • `95% CI - High` -> `95% CI - High...145`
## • `95% CI - Low` -> `95% CI - Low...148`
## • `95% CI - High` -> `95% CI - High...149`
## • `95% CI - Low` -> `95% CI - Low...152`
## • `95% CI - High` -> `95% CI - High...153`
## • `95% CI - Low` -> `95% CI - Low...159`
## • `95% CI - High` -> `95% CI - High...160`
## • `Average Grade Performance` -> `Average Grade Performance...161`
## • `Average Grade Performance (AIAN)` -> `Average Grade Performance
## (AIAN)...162`
## • `Average Grade Performance (Asian)` -> `Average Grade Performance
## (Asian)...163`
## • `Average Grade Performance (Black)` -> `Average Grade Performance
## (Black)...164`
## • `Average Grade Performance (Hispanic)` -> `Average Grade Performance
## (Hispanic)...165`
## • `Average Grade Performance (White)` -> `Average Grade Performance
## (White)...166`
## • `Average Grade Performance` -> `Average Grade Performance...167`
## • `Average Grade Performance (AIAN)` -> `Average Grade Performance
## (AIAN)...168`
## • `Average Grade Performance (Asian)` -> `Average Grade Performance
## (Asian)...169`
## • `Average Grade Performance (Black)` -> `Average Grade Performance
## (Black)...170`
## • `Average Grade Performance (Hispanic)` -> `Average Grade Performance
## (Hispanic)...171`
## • `Average Grade Performance (White)` -> `Average Grade Performance
## (White)...172`
## • `Segregation Index` -> `Segregation Index...173`
## • `95% CI - Low` -> `95% CI - Low...179`
## • `95% CI - High` -> `95% CI - High...180`
## • `95% CI - Low` -> `95% CI - Low...182`
## • `95% CI - High` -> `95% CI - High...183`
## • `Segregation Index` -> `Segregation Index...200`
## • `95% CI - Low` -> `95% CI - Low...205`
## • `95% CI - High` -> `95% CI - High...206`
## • `# Deaths` -> `# Deaths...228`
## • `95% CI - Low` -> `95% CI - Low...230`
## • `95% CI - High` -> `95% CI - High...231`
## • `95% CI - Low` -> `95% CI - Low...256`
## • `95% CI - High` -> `95% CI - High...257`
## • `95% CI - Low` -> `95% CI - Low...281`
## • `95% CI - High` -> `95% CI - High...282`
## • `95% CI - Low` -> `95% CI - Low...313`
## • `95% CI - High` -> `95% CI - High...314`
## • `95% CI - Low` -> `95% CI - Low...317`
## • `95% CI - High` -> `95% CI - High...318`
## • `95% CI - Low` -> `95% CI - Low...321`
## • `95% CI - High` -> `95% CI - High...322`
## • `95% CI - Low` -> `95% CI - Low...340`
## • `95% CI - High` -> `95% CI - High...341`
georgia_pop <- data_pop |>
filter(FIPS != 13000) |>
select(County,
`Life Expectancy`,
`Population`,
`% 65 and Over`,
`% Black`,
`% Non-Hispanic White`) |>
mutate(County = paste0(County, " County")) |>
rename(county_name = County)
head(georgia_pop)
## # A tibble: 6 × 6
## county_name `Life Expectancy` Population `% 65 and Over` `% Black`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Appling County 72.2 18428 18.6 18.6
## 2 Atkinson County 70.2 8183 14.4 14.8
## 3 Bacon County 70.3 11191 17.3 16.5
## 4 Baker County 74.5 2788 26.3 40.0
## 5 Baldwin County 73.1 43635 17.5 41.5
## 6 Banks County 74.0 19328 18.0 3.12
## # ℹ 1 more variable: `% Non-Hispanic White` <dbl>
Data about the Education
education_usa <-
read.csv('data_boso//Education2023.csv')
education_attributes <- c(
"Percent of adults who are not high school graduates, 2019-23",
"Percent of adults who are high school graduates (or equivalent), 2019-23",
"Percent of adults completing some college or associate degree, 2019-23",
"Percent of adults with a bachelor's degree or higher, 2019-23"
)
georgia_education <- education_usa |>
filter(State == "GA", Attribute %in% education_attributes) |>
filter(Area.name != 'Georgia') |>
mutate(Attribute = case_when(
Attribute == "Percent of adults who are not high school graduates, 2019-23" ~ "pct_low_edu",
Attribute == "Percent of adults who are high school graduates (or equivalent), 2019-23" ~ "pct_middle_low_edu",
Attribute == "Percent of adults completing some college or associate degree, 2019-23" ~ "pct_middle_high_edu",
Attribute == "Percent of adults with a bachelor's degree or higher, 2019-23" ~ "pct_high_edu",
TRUE ~ Attribute
)) |>
pivot_wider(
names_from = Attribute,
values_from = Value
) |>
rename(county_name = Area.name, county_fips = FIPS.Code) |>
select( -State)
head(georgia_education)
## # A tibble: 6 × 6
## county_fips county_name pct_low_edu pct_middle_low_edu pct_middle_high_edu
## <int> <chr> <dbl> <dbl> <dbl>
## 1 13001 Appling County 19.0 41.1 30.0
## 2 13003 Atkinson County 28.4 33.4 28.4
## 3 13005 Bacon County 15.4 41.7 31.9
## 4 13007 Baker County 9.46 41.7 32.0
## 5 13009 Baldwin County 14.8 33.7 27.4
## 6 13011 Banks County 17.7 34.3 30.6
## # ℹ 1 more variable: pct_high_edu <dbl>
Data about Income and Unemployment
labour_usa <- read.csv('data_boso//Unemployment2023.csv')
labour_attributes <- c(
"Unemployment_rate_2023",
"Median_Household_Income_2022"
)
georgia_labour <- labour_usa |>
filter(State == 'GA', Attribute %in% labour_attributes) |>
filter(Area_Name != 'Georgia') |>
pivot_wider(
names_from = Attribute,
values_from = Value
) |>
rename( county_name = Area_Name, Unemployment_rate = Unemployment_rate_2023, Median_Household_Income = Median_Household_Income_2022) |>
mutate(county_name = sub(", GA$", "", county_name)) |>
select(county_name, Unemployment_rate, Median_Household_Income)
head(georgia_labour)
## # A tibble: 6 × 3
## county_name Unemployment_rate Median_Household_Income
## <chr> <dbl> <dbl>
## 1 Appling County 3.4 51217
## 2 Atkinson County 3.1 46060
## 3 Bacon County 2.9 46044
## 4 Baker County 3.8 50129
## 5 Baldwin County 4.2 49883
## 6 Banks County 2.2 66928
Data about to identify the type of counties (rural-urban)
georgia_regions <- read.csv('data_boso/regions_georgiaUSA.csv')
Data about Poverty
poverty_usa <- read.csv('data_boso//Poverty2023.csv')
poverty_attributes <- c(
"PCTPOVALL_2023",
"PCTPOV017_2023"
)
georgia_poverty <- poverty_usa |>
filter(Stabr == 'GA', Attribute %in% poverty_attributes) |>
filter(Area_Name != "Georgia") |>
pivot_wider(names_from = Attribute,
values_from = Value
) |>
rename( county_name = Area_Name, pct_people_poverty = PCTPOVALL_2023, pct_children_poverty = PCTPOV017_2023) |>
select(-FIPS_Code, - Stabr)
head(georgia_poverty)
## # A tibble: 6 × 3
## county_name pct_people_poverty pct_children_poverty
## <chr> <dbl> <dbl>
## 1 Appling County 21.1 33.5
## 2 Atkinson County 22.4 34.3
## 3 Bacon County 22.9 31.1
## 4 Baker County 26.2 39.1
## 5 Baldwin County 21.4 29.2
## 6 Banks County 12.2 16.6
Unify and adjust the new dataset
georgia_data <- georgia_pop |>
full_join(georgia_regions, by = "county_name") |>
full_join(georgia_health, by = "county_name") |>
full_join(georgia_poverty, by = "county_name") |>
full_join(georgia_labour, by = "county_name") |>
full_join(georgia_education,by = "county_name")
desired_order <- c(
"county_fips", "county_name", "region", "Population", "Life Expectancy",
"% 65 and Over", "% Black", "% Non-Hispanic White", "Food Environment Index",
"% Uninsured", "Average Daily PM2.5", "% Severe Housing Problems",
"% Fair or Poor Health", "pct_people_poverty", "pct_children_poverty",
"Unemployment_rate", "Median_Household_Income", "pct_low_edu",
"pct_middle_low_edu", "pct_middle_high_edu", "pct_high_edu"
)
georgia_data <- georgia_data[, desired_order]
head(georgia_data)
## # A tibble: 6 × 21
## county_fips county_name region Population `Life Expectancy` `% 65 and Over`
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 13001 Appling County " Rur… 18428 72.2 18.6
## 2 13003 Atkinson Coun… " Rur… 8183 70.2 14.4
## 3 13005 Bacon County " Rur… 11191 70.3 17.3
## 4 13007 Baker County " Rur… 2788 74.5 26.3
## 5 13009 Baldwin County " Urb… 43635 73.1 17.5
## 6 13011 Banks County " Rur… 19328 74.0 18.0
## # ℹ 15 more variables: `% Black` <dbl>, `% Non-Hispanic White` <dbl>,
## # `Food Environment Index` <dbl>, `% Uninsured` <dbl>,
## # `Average Daily PM2.5` <dbl>, `% Severe Housing Problems` <dbl>,
## # `% Fair or Poor Health` <dbl>, pct_people_poverty <dbl>,
## # pct_children_poverty <dbl>, Unemployment_rate <dbl>,
## # Median_Household_Income <dbl>, pct_low_edu <dbl>, pct_middle_low_edu <dbl>,
## # pct_middle_high_edu <dbl>, pct_high_edu <dbl>
Now that we have our dataset we can look at each and every variable that we have:
county_fips: code to identify the county
county_name: the name of the specific county
region: to distinguish if the county in question is rural or urban.
Population: total population of the county according to the 2020 U.S. Census.
Life Expectancy: the life expectancy in that county
% 65 and Over: percentage of population ages 65 and older.
% Black: percentage of population identifying as non-Hispanic Black or African American.
% Non-Hispanic White: percentage of population identifying as non-Hispanic white.
Food Environment Index: index of factors that contribute to a healthy food environment, from 0 (worst) to 10 (best).
% Uninsured: percentage of population under age 65 without health insurance.
Average Daily PM2.5: average daily amount of fine particulate matter in micrograms per cubic meter
% Severe Housing Problems: percentage of households with at least 1 of 4 housing problems: overcrowding, high housing costs, or lack of kitchen or plumbing facilities
% Fair or Poor Health: percentage of adults that report fair or poor health
pct_people_poverty: percentage of the total population living below the poverty line.
pct_children_poverty: percentage of children living below the poverty line in the county.
Unemployment_rate: percentage of the total labor force that is unemployed.
Median_Household_Income: median household income within the county.
pct_low_edu: percentage of the population with the lowest level of education (less than a high school diploma).
pct_middle_low_edu: percentage of the population with medium-low education (high school diploma).
pct_middle_high_edu: percentage of the population with medium-high education (some college or an associate degree).
pct_high_edu: percentage of the population with the highest level of education (bachelor’s degree or higher).
summary(georgia_data)
## county_fips county_name region Population
## Min. :13001 Length:159 Length:159 Min. : 1600
## 1st Qu.:13082 Class :character Class :character 1st Qu.: 11090
## Median :13161 Mode :character Mode :character Median : 24064
## Mean :13161 Mean : 68634
## 3rd Qu.:13242 3rd Qu.: 57984
## Max. :13321 Max. :1074634
##
## Life Expectancy % 65 and Over % Black % Non-Hispanic White
## Min. :67.73 Min. : 5.443 Min. : 0.7529 Min. : 8.002
## 1st Qu.:72.12 1st Qu.:15.816 1st Qu.:14.9212 1st Qu.:50.784
## Median :73.54 Median :17.950 Median :27.5515 Median :61.166
## Mean :73.99 Mean :18.614 Mean :27.8147 Mean :60.800
## 3rd Qu.:75.69 3rd Qu.:20.346 3rd Qu.:39.1131 3rd Qu.:71.317
## Max. :81.30 Max. :36.864 Max. :71.1971 Max. :93.879
## NA's :1
## Food Environment Index % Uninsured Average Daily PM2.5
## Min. :3.30 Min. : 9.35 Min. : 7.200
## 1st Qu.:6.90 1st Qu.:15.22 1st Qu.: 9.100
## Median :7.50 Median :16.80 Median : 9.400
## Mean :7.35 Mean :16.74 Mean : 9.336
## 3rd Qu.:8.00 3rd Qu.:18.28 3rd Qu.: 9.800
## Max. :9.30 Max. :26.28 Max. :10.800
## NA's :4
## % Severe Housing Problems % Fair or Poor Health pct_people_poverty
## Min. : 7.774 Min. :10.60 Min. : 5.10
## 1st Qu.:11.589 1st Qu.:17.80 1st Qu.:13.20
## Median :13.966 Median :20.40 Median :17.80
## Mean :14.109 Mean :20.49 Mean :18.17
## 3rd Qu.:16.130 3rd Qu.:23.50 3rd Qu.:22.65
## Max. :23.874 Max. :30.70 Max. :36.30
##
## pct_children_poverty Unemployment_rate Median_Household_Income
## Min. : 5.30 Min. :2.100 Min. : 35952
## 1st Qu.:18.25 1st Qu.:2.900 1st Qu.: 46715
## Median :26.40 Median :3.300 Median : 52695
## Mean :25.71 Mean :3.489 Mean : 58390
## 3rd Qu.:32.00 3rd Qu.:3.800 3rd Qu.: 65052
## Max. :45.10 Max. :7.800 Max. :130909
##
## pct_low_edu pct_middle_low_edu pct_middle_high_edu pct_high_edu
## Min. : 2.869 Min. :14.51 Min. :17.64 Min. : 7.817
## 1st Qu.:10.584 1st Qu.:31.39 1st Qu.:26.91 1st Qu.:13.654
## Median :14.786 Median :35.68 Median :29.36 Median :18.149
## Mean :14.741 Mean :35.29 Mean :29.12 Mean :20.849
## 3rd Qu.:17.978 3rd Qu.:40.44 3rd Qu.:31.48 3rd Qu.:24.902
## Max. :28.386 Max. :49.21 Max. :43.36 Max. :57.995
##
We have 4 NAs in the variable ‘Food Environment Index’ and one in ‘Life Expectancy’, we will try to manually estimate the NAs basically looking at similar countries that share the same characteristic (IMPUTE WITH THE MEAN) -> compute the variable of similar counties, doing the mean to predict the missing value
m = 4 # number of multiple imputations
mice_mod <- mice(georgia_data, m=m, method='rf')
##
## iter imp variable
## 1 1 Life Expectancy Food Environment Index
## 1 2 Life Expectancy Food Environment Index
## 1 3 Life Expectancy Food Environment Index
## 1 4 Life Expectancy Food Environment Index
## 2 1 Life Expectancy Food Environment Index
## 2 2 Life Expectancy Food Environment Index
## 2 3 Life Expectancy Food Environment Index
## 2 4 Life Expectancy Food Environment Index
## 3 1 Life Expectancy Food Environment Index
## 3 2 Life Expectancy Food Environment Index
## 3 3 Life Expectancy Food Environment Index
## 3 4 Life Expectancy Food Environment Index
## 4 1 Life Expectancy Food Environment Index
## 4 2 Life Expectancy Food Environment Index
## 4 3 Life Expectancy Food Environment Index
## 4 4 Life Expectancy Food Environment Index
## 5 1 Life Expectancy Food Environment Index
## 5 2 Life Expectancy Food Environment Index
## 5 3 Life Expectancy Food Environment Index
## 5 4 Life Expectancy Food Environment Index
## Warning: Number of logged events: 42
georgia_data <- complete(mice_mod, action=m)
Now we will do the summery operation again to see if we have solved the NAs problem
summary(georgia_data)
## county_fips county_name region Population
## Min. :13001 Length:159 Length:159 Min. : 1600
## 1st Qu.:13082 Class :character Class :character 1st Qu.: 11090
## Median :13161 Mode :character Mode :character Median : 24064
## Mean :13161 Mean : 68634
## 3rd Qu.:13242 3rd Qu.: 57984
## Max. :13321 Max. :1074634
## Life Expectancy % 65 and Over % Black % Non-Hispanic White
## Min. :67.73 Min. : 5.443 Min. : 0.7529 Min. : 8.002
## 1st Qu.:72.11 1st Qu.:15.816 1st Qu.:14.9212 1st Qu.:50.784
## Median :73.49 Median :17.950 Median :27.5515 Median :61.166
## Mean :73.97 Mean :18.614 Mean :27.8147 Mean :60.800
## 3rd Qu.:75.67 3rd Qu.:20.346 3rd Qu.:39.1131 3rd Qu.:71.317
## Max. :81.30 Max. :36.864 Max. :71.1971 Max. :93.879
## Food Environment Index % Uninsured Average Daily PM2.5
## Min. :3.300 Min. : 9.35 Min. : 7.200
## 1st Qu.:6.900 1st Qu.:15.22 1st Qu.: 9.100
## Median :7.500 Median :16.80 Median : 9.400
## Mean :7.358 Mean :16.74 Mean : 9.336
## 3rd Qu.:8.000 3rd Qu.:18.28 3rd Qu.: 9.800
## Max. :9.300 Max. :26.28 Max. :10.800
## % Severe Housing Problems % Fair or Poor Health pct_people_poverty
## Min. : 7.774 Min. :10.60 Min. : 5.10
## 1st Qu.:11.589 1st Qu.:17.80 1st Qu.:13.20
## Median :13.966 Median :20.40 Median :17.80
## Mean :14.109 Mean :20.49 Mean :18.17
## 3rd Qu.:16.130 3rd Qu.:23.50 3rd Qu.:22.65
## Max. :23.874 Max. :30.70 Max. :36.30
## pct_children_poverty Unemployment_rate Median_Household_Income
## Min. : 5.30 Min. :2.100 Min. : 35952
## 1st Qu.:18.25 1st Qu.:2.900 1st Qu.: 46715
## Median :26.40 Median :3.300 Median : 52695
## Mean :25.71 Mean :3.489 Mean : 58390
## 3rd Qu.:32.00 3rd Qu.:3.800 3rd Qu.: 65052
## Max. :45.10 Max. :7.800 Max. :130909
## pct_low_edu pct_middle_low_edu pct_middle_high_edu pct_high_edu
## Min. : 2.869 Min. :14.51 Min. :17.64 Min. : 7.817
## 1st Qu.:10.584 1st Qu.:31.39 1st Qu.:26.91 1st Qu.:13.654
## Median :14.786 Median :35.68 Median :29.36 Median :18.149
## Mean :14.741 Mean :35.29 Mean :29.12 Mean :20.849
## 3rd Qu.:17.978 3rd Qu.:40.44 3rd Qu.:31.48 3rd Qu.:24.902
## Max. :28.386 Max. :49.21 Max. :43.36 Max. :57.995
Seem like we do! Now we will check for some duplicates
sum(duplicated(georgia_data))
## [1] 0
There aren’t!
numeric_vars <- georgia_data |>
select(where(is.numeric)) |>
names()
georgia_data |>
select(all_of(numeric_vars)) |>
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") |>
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
facet_wrap(~ variable, scales = "free") +
labs(title = "Histograms of Numeric Variables") +
theme_minimal() +
scale_x_continuous(labels = comma)
From these histograms, we can extract several important insights into the socioeconomic and demographic characteristics of the state of Georgia.
Population-related variables such as Population, Unemployment Rate, and % 65 and Over display a clear right-skew, indicating that most counties are sparsely populated with a smaller proportion of elderly residents. However, a few urban centers dominate in terms of population size.
Economic indicators reveal notable disparities across the state. The Median Household Income distribution is right-skewed, suggesting that while most counties have moderate income levels, a few wealthier counties significantly raise the upper range. Poverty-related metrics, such as pct_people_poverty and pct_children_poverty, also show right-skewed distributions, highlighting that economic hardship is widespread across many counties, with some areas experiencing extreme poverty levels.
Educational attainment reflects a similar divide, with variables like pct_high_education and pct_low-education indicating that most counties have low levels of higher educational achievement. However, a small number of counties stand out with significantly higher education rates, suggesting localized educational advantages.
Environmental factors, including Average Daily PM2.5 and the Food Environment Index, show a more normal distribution, indicating relatively consistent conditions across most counties, though some areas face greater environmental and food accessibility challenges.
Health-related variables, such as % Uninsured and Life Expectancy, are also right-skewed, implying disparities in healthcare access and outcomes across different counties. Some counties show much higher uninsured rates and shorter life expectancies than the state average.
Racial demographics reveal significant disparities. The % Non-Hispanic White population follows an approximately normal distribution, suggesting a relatively even spread across counties. In contrast, the % Black population distribution is right-skewed, indicating that while many counties have smaller Black populations, there are specific areas with significantly higher concentrations.
georgia_data |>
select(all_of(numeric_vars)) |>
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") |>
ggplot(aes(y = value)) +
geom_boxplot(fill = "steelblue") +
facet_wrap(~ variable, scales = "free") +
labs(title = "Boxplots of Numeric Variables") +
scale_y_continuous(labels = comma) +
theme_minimal()
These updated boxplots provide additional insights.
While outliers in Population, Unemployment Rate, and % 65 and Over still emphasize urban dominance, the Median Household Income boxplot now reveals a more pronounced spread, suggesting a wider income disparity than previously observed.
Environmental factors such as Average Daily PM2.5 and the Food Environment Index now show clearer outliers, indicating that some counties face significantly worse air quality and food accessibility challenges compared to others.
Health-related variables, including % Uninsured and Life Expectancy, display more distinct outliers, highlighting areas with particularly limited healthcare access and shorter life expectancies.
In terms of Education, the variability in pct_high_education and pct_low_education is more evident, suggesting sharper educational inequalities, with some counties significantly outperforming others.
Racial demographics remain consistent; however, the % Black distribution reveals even greater variability than before, suggesting a stronger concentration of Black populations in specific counties. Additionally, variables like % Severe Housing Problems now show clearer disparities, pointing to housing challenges in certain regions that weren’t as apparent in previous plots.
ggcorr(georgia_data, label = T)
## Warning in ggcorr(georgia_data, label = T): data in column(s) 'county_name',
## 'region' are not numeric and were ignored
From the correlation matrix, we can observe several strong relationships between socioeconomic and health-related variables. There is a strong positive correlation between pct_children_poverty and pct_poverty_all (0.9), suggesting that areas with higher child poverty rates also tend to have higher overall poverty levels. Similarly, pct_people_poverty shows a strong positive correlation with % Fair and Poor Health (0.9), indicating that regions with higher poverty levels tend to report poorer health outcomes. Additionally, % Fair and Poor Health is also strongly positively correlated with pct_children_poverty (0.9), reinforcing the connection between economic hardship and adverse health conditions.
On the other hand, median_house_income demonstrates strong negative correlations with several variables: pct_people_poverty (-0.9), pct_children_poverty (-0.9), and % Fair and Poor Health (-0.9). This indicates that areas with higher median household incomes tend to have lower poverty rates and better overall health outcomes. There is also a notable positive correlation between median_house_income and pct_high_education (0.8), suggesting that higher income levels are associated with greater educational attainment.
Education appears to play a significant role in shaping health outcomes as well. There is a strong negative correlation between % Fair and Poor Health and pct_high_education (-0.8), implying that higher education levels are linked to better health statuses within a population. These correlations highlight the interconnectedness of income, education, poverty, and health in shaping socioeconomic conditions.
b = ggplot(georgia_data, aes(x = `% Fair or Poor Health`,
y = pct_high_edu,
group = region,
size = Population,
color = Median_Household_Income,
text = county_name)) +
geom_point(alpha = 0.8) +
facet_wrap(~ region) +
scale_color_gradient(low = "red", high = "lightgreen") +
theme_minimal() +
labs(
title = "% Poor Health vs. % High Education",
subtitle = "(Color indicates median household income, size indicates population)",
x = "Percentage of Poor Health",
y = "Percentage of High Education"
)
ggplotly(b, tooltip=c("county_name"))
We can gain some insights into the characteristics of Georgia’s regions. As seen in the correlation matrix there appears to be a negative correlation between the percentage of individuals reporting poor health and the percentage of those with higher education. Additionally, median household income appears to be positively associated with educational attainment (they have a correlation of 0.8) and inversely related to poor health outcomes (with a correlation of -0.9).
The distinction between rural and urban counties is particularly visible. Rural counties are concentrated in the lower-right quadrant of the graph, indicating higher percentages of poor health and lower education levels, whereas urban counties are more dispersed, with a noticeable presence in areas of higher education and better health outcomes. Looking in more detail at the different types of counties, we can observe the following:
In rural counties, there is a clear concentration of data points showing high rates of poor health and low percentages of higher education attainment. Most rural counties also display lower median household incomes. This suggests that rural areas in Georgia are more economically disadvantaged, with limited access to educational and healthcare resources contributing to poorer health outcomes.
In contrast, urban counties tend to show higher education levels and lower rates of poor health. These counties also generally have higher household incomes. The urban data points are spread more widely across the graph, highlighting greater socioeconomic diversity within urban areas.
Since I have numerous interrelated variables and a moderately high number of them, I aim to reduce the dimensionality of the dataset by representing the multivariate information with a smaller set of variables, while retaining as much of the original information as possible -> so I will apply PCA
Y = georgia_data %>% dplyr::select(-county_fips, -county_name,-region, -Population)
# We just save the names and continents for the graphs
county_name = georgia_data$county_name
region = georgia_data$region
pca = prcomp(Y, scale = TRUE)
pca
## Standard deviations (1, .., p=17):
## [1] 2.749131e+00 1.649852e+00 1.134677e+00 1.073561e+00 1.013717e+00
## [6] 8.279902e-01 7.774250e-01 7.179994e-01 6.536368e-01 5.795667e-01
## [11] 5.515293e-01 4.262484e-01 2.899749e-01 2.275481e-01 1.963083e-01
## [16] 1.540250e-01 3.828840e-10
##
## Rotation (n x k) = (17 x 17):
## PC1 PC2 PC3 PC4
## Life Expectancy 0.26869980 -0.07840211 0.02848331 -0.28506891
## % 65 and Over -0.09369621 0.19598256 -0.35150511 -0.41429568
## % Black -0.18983863 -0.46865553 0.02159083 -0.08089284
## % Non-Hispanic White 0.15749793 0.49345283 -0.11300534 0.11160588
## Food Environment Index 0.19607856 0.09895055 0.17421991 -0.26657284
## % Uninsured -0.20851609 0.32354837 0.01338835 -0.19259345
## Average Daily PM2.5 0.06480388 -0.30622838 0.42845407 0.34377360
## % Severe Housing Problems -0.12274240 -0.28693450 -0.36247711 -0.21597076
## % Fair or Poor Health -0.34789861 -0.02967276 0.03367726 0.01799548
## pct_people_poverty -0.33247880 -0.08751688 0.03942420 -0.10100104
## pct_children_poverty -0.34177247 -0.07332402 -0.03389231 -0.11289412
## Unemployment_rate -0.21964399 -0.20632074 -0.05461702 -0.05205754
## Median_Household_Income 0.33774241 -0.05826897 0.16354781 -0.04582862
## pct_low_edu -0.28278184 0.17502026 0.28874163 -0.08968882
## pct_middle_low_edu -0.26462129 0.23425954 0.14790276 0.24543319
## pct_middle_high_edu 0.08323058 -0.08859812 -0.61471510 0.50526523
## pct_high_edu 0.29895044 -0.22020311 -0.01457694 -0.32165733
## PC5 PC6 PC7 PC8
## Life Expectancy -0.028347867 0.18611858 0.17672421 0.50396176
## % 65 and Over 0.540963034 0.10581820 -0.46150281 0.07773272
## % Black 0.173113714 -0.21514404 0.06736102 -0.01862919
## % Non-Hispanic White 0.050936928 0.14394528 -0.15825589 -0.19054150
## Food Environment Index 0.292850436 -0.78323922 0.07374477 0.02765785
## % Uninsured -0.341880202 -0.05103896 -0.16874173 0.37106108
## Average Daily PM2.5 0.147943440 0.04337263 -0.68293054 0.17924340
## % Severe Housing Problems -0.455782684 -0.17054823 -0.38001982 -0.30079070
## % Fair or Poor Health -0.085430957 -0.01779792 0.06476801 0.14066895
## pct_people_poverty -0.003839476 -0.00124731 0.14177528 0.01450418
## pct_children_poverty 0.120170684 -0.03030781 -0.02154441 0.05282698
## Unemployment_rate 0.380452588 0.38065099 0.20945934 0.04490845
## Median_Household_Income 0.023012603 0.06261067 -0.01536819 0.06071523
## pct_low_edu -0.159307562 -0.03857730 -0.08777239 0.33687756
## pct_middle_low_edu 0.194956828 -0.07719123 0.06058324 -0.30352525
## pct_middle_high_edu 0.068277605 -0.23535268 0.03357954 0.43764795
## pct_high_edu -0.081529552 0.16501187 -0.01046079 -0.12896053
## PC9 PC10 PC11 PC12
## Life Expectancy 0.30883856 0.4059772581 0.48278410 0.04321140
## % 65 and Over 0.21695213 -0.0023797195 -0.15930193 0.12585702
## % Black 0.16478049 0.1035496697 -0.26827398 0.08294668
## % Non-Hispanic White -0.11925984 -0.1428983700 0.37412895 -0.02466704
## Food Environment Index -0.28496858 0.0033680511 0.20509013 -0.07330209
## % Uninsured -0.24770069 0.2976193470 -0.33644518 -0.47839091
## Average Daily PM2.5 -0.05059518 0.0199762958 0.18125664 -0.19307788
## % Severe Housing Problems -0.16412446 0.2635002843 0.29861356 0.21593412
## % Fair or Poor Health 0.10192692 0.0005116272 0.03048176 0.04788494
## pct_people_poverty 0.15127083 -0.2860492486 0.37639600 -0.30302049
## pct_children_poverty 0.16067849 -0.2202474192 0.10555033 -0.24473153
## Unemployment_rate -0.71839994 0.1881891567 0.11845431 0.02722954
## Median_Household_Income -0.01082416 0.1275288412 -0.26891257 0.20488868
## pct_low_edu -0.13630731 -0.2808960512 0.04098699 0.61038744
## pct_middle_low_edu 0.21033552 0.5713858487 0.06959192 -0.01740626
## pct_middle_high_edu -0.05134946 -0.0618370337 0.01199742 -0.01733437
## pct_high_edu -0.05790430 -0.2332250640 -0.07457472 -0.29455302
## PC13 PC14 PC15 PC16
## Life Expectancy 0.09397618 0.03832409 0.04377321 -0.107528857
## % 65 and Over -0.01065253 0.03888722 -0.08192903 0.195244308
## % Black 0.20371840 0.06846357 -0.40256674 -0.575498872
## % Non-Hispanic White -0.10077526 0.08802469 -0.27947469 -0.592091998
## Food Environment Index -0.07887813 0.07455962 0.04455994 0.049999521
## % Uninsured 0.04559467 -0.06754179 -0.13547909 -0.109258313
## Average Daily PM2.5 0.01025984 0.05967851 -0.01067560 0.033286269
## % Severe Housing Problems -0.09341442 -0.10571899 0.02546875 0.011702511
## % Fair or Poor Health -0.61950272 0.66879126 -0.01336160 0.013223482
## pct_people_poverty -0.11787666 -0.33753165 -0.54898030 0.286900623
## pct_children_poverty -0.18154669 -0.33007455 0.63410694 -0.394900070
## Unemployment_rate -0.03412269 -0.01957905 0.01072176 -0.004096258
## Median_Household_Income -0.67902004 -0.47178459 -0.15366799 -0.085892812
## pct_low_edu 0.15253284 -0.11787539 -0.01775695 -0.048793246
## pct_middle_low_edu -0.02247271 -0.11653376 0.02178570 0.063016758
## pct_middle_high_edu -0.06498771 -0.08883005 -0.04274925 0.011308647
## pct_high_edu -0.03743016 0.17673784 0.01033986 -0.023580449
## PC17
## Life Expectancy -7.199676e-11
## % 65 and Over 4.747249e-11
## % Black -1.893458e-10
## % Non-Hispanic White -2.181302e-10
## Food Environment Index 1.476484e-11
## % Uninsured -1.026391e-11
## Average Daily PM2.5 -1.311142e-11
## % Severe Housing Problems 2.460665e-11
## % Fair or Poor Health -7.779529e-11
## pct_people_poverty 8.518379e-11
## pct_children_poverty -8.997977e-11
## Unemployment_rate 5.025109e-12
## Median_Household_Income -1.753227e-11
## pct_low_edu -3.719367e-01
## pct_middle_low_edu -5.087446e-01
## pct_middle_high_edu -2.799372e-01
## pct_high_edu -7.242079e-01
From this results we can see that the standard deviation for each principal component is the highest in the first PCs (PC1-PC2) meaning that they capture the most significant variances in the georgia_data dataset (after the first PC the amount of variance explain by each component decrease significantly at every following PC).
From the first principal component: we can find that the highest contribution (in absolute values) came from the variables: % Fair or Poor Health, pct_children_poverty, Median_Household_Income, pct_people_poverty, pct_high_edu, pct_low_edu, Life Expectancy and pct_middle_low_edu -> this dimension is dominated by poverty, education, health, and economic factors, it likely represents a “Socioeconomic & Health Disparity Dimension.”
From the second principal component: the highest contribution came from: Non-Hispanic White, % Black, % Uninsured, Average Daily PM25 and % Severe Housing Problems
fviz_screeplot(pca, addlabels = TRUE)
The barplot rapresents the interpretation that i was talking about before. But from this we can get an information that we couldn’t know before: the percentage of the contibution of every singular component to the whole model. The PC1 for example can, alone, explains 44.4% of the total variance of the original datasets if we add as well the variables that compone the PC2 (which explains around 16%) the cumulative variance will be 60.4%
Now we can look in more detail the relevant variables of the PC1. As said before PC1 probably captures socioeconomic and health-related disparities.
fviz_contrib(pca, choice = "var", axes = 1)
We can get also a list of 10 counties where the influence of the variables associated with the PC1 is the highest and the lowest.
county_name[order(pca$x[,1])][1:5]
## [1] "Hancock County" "Telfair County" "Macon County"
## [4] "Quitman County" "Taliaferro County"
county_name[order(pca$x[,1], decreasing=T)][1:5]
## [1] "Forsyth County" "Oconee County" "Fayette County" "Cherokee County"
## [5] "Columbia County"
We can filters the dataset to display all variables for Hancock County and Forsyth County and compare it with the summary of the georgia_data dataset. It become evident that Hancock County is one of the most disadvantaged (the majority of the values of its variables are closed to the min or to the first quantile). And that Forsyth County is one of the most advantaged (all values of the variables present in the PC1 are closed to the max or to the 3rd quantile).
summary(georgia_data)
## county_fips county_name region Population
## Min. :13001 Length:159 Length:159 Min. : 1600
## 1st Qu.:13082 Class :character Class :character 1st Qu.: 11090
## Median :13161 Mode :character Mode :character Median : 24064
## Mean :13161 Mean : 68634
## 3rd Qu.:13242 3rd Qu.: 57984
## Max. :13321 Max. :1074634
## Life Expectancy % 65 and Over % Black % Non-Hispanic White
## Min. :67.73 Min. : 5.443 Min. : 0.7529 Min. : 8.002
## 1st Qu.:72.11 1st Qu.:15.816 1st Qu.:14.9212 1st Qu.:50.784
## Median :73.49 Median :17.950 Median :27.5515 Median :61.166
## Mean :73.97 Mean :18.614 Mean :27.8147 Mean :60.800
## 3rd Qu.:75.67 3rd Qu.:20.346 3rd Qu.:39.1131 3rd Qu.:71.317
## Max. :81.30 Max. :36.864 Max. :71.1971 Max. :93.879
## Food Environment Index % Uninsured Average Daily PM2.5
## Min. :3.300 Min. : 9.35 Min. : 7.200
## 1st Qu.:6.900 1st Qu.:15.22 1st Qu.: 9.100
## Median :7.500 Median :16.80 Median : 9.400
## Mean :7.358 Mean :16.74 Mean : 9.336
## 3rd Qu.:8.000 3rd Qu.:18.28 3rd Qu.: 9.800
## Max. :9.300 Max. :26.28 Max. :10.800
## % Severe Housing Problems % Fair or Poor Health pct_people_poverty
## Min. : 7.774 Min. :10.60 Min. : 5.10
## 1st Qu.:11.589 1st Qu.:17.80 1st Qu.:13.20
## Median :13.966 Median :20.40 Median :17.80
## Mean :14.109 Mean :20.49 Mean :18.17
## 3rd Qu.:16.130 3rd Qu.:23.50 3rd Qu.:22.65
## Max. :23.874 Max. :30.70 Max. :36.30
## pct_children_poverty Unemployment_rate Median_Household_Income
## Min. : 5.30 Min. :2.100 Min. : 35952
## 1st Qu.:18.25 1st Qu.:2.900 1st Qu.: 46715
## Median :26.40 Median :3.300 Median : 52695
## Mean :25.71 Mean :3.489 Mean : 58390
## 3rd Qu.:32.00 3rd Qu.:3.800 3rd Qu.: 65052
## Max. :45.10 Max. :7.800 Max. :130909
## pct_low_edu pct_middle_low_edu pct_middle_high_edu pct_high_edu
## Min. : 2.869 Min. :14.51 Min. :17.64 Min. : 7.817
## 1st Qu.:10.584 1st Qu.:31.39 1st Qu.:26.91 1st Qu.:13.654
## Median :14.786 Median :35.68 Median :29.36 Median :18.149
## Mean :14.741 Mean :35.29 Mean :29.12 Mean :20.849
## 3rd Qu.:17.978 3rd Qu.:40.44 3rd Qu.:31.48 3rd Qu.:24.902
## Max. :28.386 Max. :49.21 Max. :43.36 Max. :57.995
georgia_data[county_name == "Hancock County", ]
## county_fips county_name region Population Life Expectancy % 65 and Over
## 70 13141 Hancock County Rural 8387 71.07075 25.38452
## % Black % Non-Hispanic White Food Environment Index % Uninsured
## 70 67.1277 27.49493 4.8 15.55379
## Average Daily PM2.5 % Severe Housing Problems % Fair or Poor Health
## 70 9.6 16.28664 28.6
## pct_people_poverty pct_children_poverty Unemployment_rate
## 70 30.3 44.1 5.5
## Median_Household_Income pct_low_edu pct_middle_low_edu pct_middle_high_edu
## 70 38570 26.09486 39.3696 26.17063
## pct_high_edu
## 70 8.364904
georgia_data[county_name == "Forsyth County", ]
## county_fips county_name region Population Life Expectancy % 65 and Over
## 58 13117 Forsyth County Urban 267237 81.30234 12.86648
## % Black % Non-Hispanic White Food Environment Index % Uninsured
## 58 4.509855 63.93651 8.7 9.717827
## Average Daily PM2.5 % Severe Housing Problems % Fair or Poor Health
## 58 10.4 9.331424 10.6
## pct_people_poverty pct_children_poverty Unemployment_rate
## 58 5.1 5.3 2.5
## Median_Household_Income pct_low_edu pct_middle_low_edu pct_middle_high_edu
## 58 130909 5.940409 14.51037 22.94617
## pct_high_edu
## 58 56.60305
We can see how the other variables that compose the second dimension are influencing the different counties, we can see that for example, % of white people in the county has a beneficial effect in this dimension and that % of black have a negative one. If we look at absolute values we can say that the other variables that influence the most the PC2 are % Uninsured (in a positive way, if the % is growing so are the effect on this dimension) and the Average Daily PM2.5,% and Severe Housing Problems (in a negative way). It’s interesting to notice how education influence this dimension, lower level have a positive effects and the % of people with a bachelor or a university degree have a negative effect. A positive effect came also from the % of people how are 65 or older.
barplot(pca$rotation[,2], las=2, col="steelblue")
Contribution of variables to second component: we can see that the most influential variables are % of white and % of black (contribution to the PC2 by the 25% and 22% ). Other relevant variables are % Uninsured, Average Daily PM2.5,% and Severe Housing Problems (by a lesser % then the previous 2 variables, but they still have a considerable influence in the contribution of the PC2). All this component, who defy the second dimension, are in some type way capturing a demographic, environmental and social divide between the counties of the State of Georgia.
fviz_contrib(pca, choice = "var", axes = 2)
Maybe we can get more insights by ranking the countries using this component:
county_name[order(pca$x[,2])][1:5]
## [1] "Dougherty County" "Fulton County" "Richmond County" "Clayton County"
## [5] "DeKalb County"
county_name[order(pca$x[,2], decreasing=T)][1:5]
## [1] "Brantley County" "Fannin County" "White County" "Rabun County"
## [5] "Gilmer County"
summary(georgia_data)
## county_fips county_name region Population
## Min. :13001 Length:159 Length:159 Min. : 1600
## 1st Qu.:13082 Class :character Class :character 1st Qu.: 11090
## Median :13161 Mode :character Mode :character Median : 24064
## Mean :13161 Mean : 68634
## 3rd Qu.:13242 3rd Qu.: 57984
## Max. :13321 Max. :1074634
## Life Expectancy % 65 and Over % Black % Non-Hispanic White
## Min. :67.73 Min. : 5.443 Min. : 0.7529 Min. : 8.002
## 1st Qu.:72.11 1st Qu.:15.816 1st Qu.:14.9212 1st Qu.:50.784
## Median :73.49 Median :17.950 Median :27.5515 Median :61.166
## Mean :73.97 Mean :18.614 Mean :27.8147 Mean :60.800
## 3rd Qu.:75.67 3rd Qu.:20.346 3rd Qu.:39.1131 3rd Qu.:71.317
## Max. :81.30 Max. :36.864 Max. :71.1971 Max. :93.879
## Food Environment Index % Uninsured Average Daily PM2.5
## Min. :3.300 Min. : 9.35 Min. : 7.200
## 1st Qu.:6.900 1st Qu.:15.22 1st Qu.: 9.100
## Median :7.500 Median :16.80 Median : 9.400
## Mean :7.358 Mean :16.74 Mean : 9.336
## 3rd Qu.:8.000 3rd Qu.:18.28 3rd Qu.: 9.800
## Max. :9.300 Max. :26.28 Max. :10.800
## % Severe Housing Problems % Fair or Poor Health pct_people_poverty
## Min. : 7.774 Min. :10.60 Min. : 5.10
## 1st Qu.:11.589 1st Qu.:17.80 1st Qu.:13.20
## Median :13.966 Median :20.40 Median :17.80
## Mean :14.109 Mean :20.49 Mean :18.17
## 3rd Qu.:16.130 3rd Qu.:23.50 3rd Qu.:22.65
## Max. :23.874 Max. :30.70 Max. :36.30
## pct_children_poverty Unemployment_rate Median_Household_Income
## Min. : 5.30 Min. :2.100 Min. : 35952
## 1st Qu.:18.25 1st Qu.:2.900 1st Qu.: 46715
## Median :26.40 Median :3.300 Median : 52695
## Mean :25.71 Mean :3.489 Mean : 58390
## 3rd Qu.:32.00 3rd Qu.:3.800 3rd Qu.: 65052
## Max. :45.10 Max. :7.800 Max. :130909
## pct_low_edu pct_middle_low_edu pct_middle_high_edu pct_high_edu
## Min. : 2.869 Min. :14.51 Min. :17.64 Min. : 7.817
## 1st Qu.:10.584 1st Qu.:31.39 1st Qu.:26.91 1st Qu.:13.654
## Median :14.786 Median :35.68 Median :29.36 Median :18.149
## Mean :14.741 Mean :35.29 Mean :29.12 Mean :20.849
## 3rd Qu.:17.978 3rd Qu.:40.44 3rd Qu.:31.48 3rd Qu.:24.902
## Max. :28.386 Max. :49.21 Max. :43.36 Max. :57.995
georgia_data[county_name == "Dougherty County", ]
## county_fips county_name region Population Life Expectancy % 65 and Over
## 47 13095 Dougherty County Urban 82966 71.41195 17.31794
## % Black % Non-Hispanic White Food Environment Index % Uninsured
## 47 70.94713 23.30834 5.2 16.48171
## Average Daily PM2.5 % Severe Housing Problems % Fair or Poor Health
## 47 10.8 21.95122 24.4
## pct_people_poverty pct_children_poverty Unemployment_rate
## 47 26.4 36.9 4.3
## Median_Household_Income pct_low_edu pct_middle_low_edu pct_middle_high_edu
## 47 42629 14.64419 30.84829 31.26925
## pct_high_edu
## 47 23.23827
georgia_data[county_name == "Brantley County", ]
## county_fips county_name region Population Life Expectancy % 65 and Over
## 13 13025 Brantley County Rural 18183 72.11387 17.70335
## % Black % Non-Hispanic White Food Environment Index % Uninsured
## 13 3.53077 90.68911 7.9 22.17292
## Average Daily PM2.5 % Severe Housing Problems % Fair or Poor Health
## 13 8.2 12.86031 21.1
## pct_people_poverty pct_children_poverty Unemployment_rate
## 13 17.8 25.8 3.5
## Median_Household_Income pct_low_edu pct_middle_low_edu pct_middle_high_edu
## 13 50370 18.76833 43.59198 26.7417
## pct_high_edu
## 13 10.89799
As said before the counties with the highest % of Black folks and generally higher level Average Daily PM2.5,% and Severe Housing Problems are on the bottom ranking of this component (like Dougherty County) and the ones that have a high % of white folks and % Uninsured and a low attributes of Average Daily PM2.5,% and Severe Housing Problems are on the top part of this ranking.
Now we can compare the first two Pcs and see if they are independent between them.
data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label=county_name,color=region)) + geom_point(size=0) +
labs(title="First two principal components (scores)", x="PC1", y="PC2") + #guides(color=guide_legend(title="HDI"))+
theme_bw() +theme(legend.position="bottom") + geom_text(size=3, hjust=0.6, vjust=0, check_overlap = TRUE)
As said before the two component represents different dimensions:
And from this plot we can see that the urban counties (e.g., Fulton, Forsyth, Cobb, Columbia) are positioned on the right side of the graph (high PC1 scores), meaning they have better socioeconomic conditions (higher income, better education, and lower poverty). On the other hand, rural counties (e.g., Dougherty, Hancock, Wheeler, Telfair) appear on the left (low PC1 scores), indicating worse socioeconomic conditions (higher poverty, worse health outcomes, and lower income). We can say that PC1 and Pc2 are independent because:
a high PC1 score doesn’t predict PC2 scores
each PC captures unique information without overlapping with others.
With clustering, we aim to go beyond the simple rural vs. urban dichotomy and identify groups of counties with similar characteristics. This approach also helps reduce complexity, as it allows us to analyze a few representative groups instead of examining each county individually, making the analysis more structured and insightful.
Let’s start with 4 different cluster and let’s see their characteristics
fit = kmeans(scale(Y), centers=3, nstart = 1000)
centers=fit$centers
k = 3
par(mfrow = c(2, 3))
for (i in 1:k) {
barplot(centers[i, ],
main = paste("Cluster", i, "Center"),
las = 2,
col = "steelblue",
ylim = c(min(centers), max(centers)))
}
par(mfrow = c(1, 1))
We can not see in detail, it’s better to see every graph one by one:
centers=fit$centers
par(mar=c(10,4,4,2))
barplot(centers[1,], las=2, col="steelblue", cex.names=0.8)
barplot(centers[2,], las=2, col="orange", cex.names=0.8)
barplot(centers[3,], las=2, col="purple", cex.names=0.8)
The first cluster has a lower percentage of people aged 65 and older than the average, a higher percentage of Black residents, and fewer White residents. There is also a higher percentage of people affected by severe housing problems. The population is generally well-educated, with a higher percentage of individuals who have attended college or university compared to those with lower levels of education. Additionally, key indicators such as life expectancy, percentage of people in fair or poor health, overall poverty rate, and child poverty rate show close to zero variance. There is a moderate level of unemployment, and the median household income is slightly lower than the average. This cluster generally represents diverse, socioeconomically challenged communities with a stable yet unequal health and economic landscape.
The second cluster has a higher percentage of White people and a lower percentage of Black people. Several variables have values slightly above the average, including life expectancy, the percentage of people aged 65 and older, the food environment index, the percentage of uninsured individuals, and median household income. Conversely, some variables have values below the average, such as average daily PM2.5 (air pollution), the percentage of severe housing problems, the percentage of people in fair or poor health, the overall poverty rate, the child poverty rate, and the unemployment rate. This cluster represents predominantly White, older, suburban, or rural communities with stable economic and health conditions.
The third cluster has higher values for life expectancy, the food environment index, and average daily PM2.5 (air pollution). The highest values are found in median household income and the percentage of individuals with higher education (pct_high_edu). Moderately lower values are observed for the percentage of people aged 65 and older, the percentage of Black residents, the percentage of severe housing problems, and the unemployment rate. Significantly lower values are recorded for the percentage of people in fair or poor health, the overall poverty rate, child poverty, and the percentages of individuals with low and middle-low education levels. Variables that remain close to zero include the percentage of White residents and the percentage of individuals with middle-high education (pct_middle_high_edu). This cluster represents affluent, highly educated communities with strong economic and health conditions. It is characterized by high median household income and a well-educated population, with a significant share of residents holding higher education degrees.
We can also rapresent the 3 different cluster in a plot
fviz_cluster(fit, data = Y, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal()+geom_text(label=county_name,hjust=0, vjust=0,size=2,check_overlap = F)+scale_fill_brewer(palette="Paired")
Comparing these results with the division between rural and urban counties, we can see that the first and second clusters appear to have a fairly equal distribution of rural and urban counties. However, the first cluster contains the fewest counties among all four clusters. The third cluster is the only one where urban counties outnumber rural counties, although a considerable number of rural counties are still present in this group.
We can see the number of cluster that the machine will reccomand us to use for the clustering
fviz_nbclust(scale(Y), kmeans, method = 'wss', k.max = 20, nstart = 1000)
fviz_nbclust(scale(Y), kmeans, method = 'silhouette', k.max = 20, nstart = 1000)
The ‘silhouette’ method indicates that the optimal number of clusters is two, which is valid; however, in my opinion, this would oversimplify the analysis, essentially reducing it to a basic division between rural and urban areas. This approach would result in the loss of valuable insights gained from the clustering process and limit the depth of the analysis.
It is important to decide on the distance metric and linkage method when performing hierarchical clustering.
At first, we compute the distance matrix using the Euclidean distance after scaling the data. Then, we apply hierarchical clustering using the “complete” linkage method, which considers the maximum distance between clusters when merging them. I have considerate using other linkage methods such as “average”, “single”, “mcquitty” but in the following part, the one about the visualization by dendrogram, the results (apart from method = ‘complete’) where strange with a big cluster with the 90% of the counties that where forming the cluster and the other three cluster with only 2-3 counties.
#basic hierarchical
d = dist(scale(Y), method = "euclidean")
hc <- hclust(d, method = "complete")
Classical dendrogram:
hc$labels <- county_name
fviz_dend(x = hc,
k=3,
palette = "jco",
rect = TRUE, rect_fill = TRUE, cex=0.5,
rect_border = "jco"
)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
From this dendrogram, we can observe the relationship within the
clustering of these counties. Looking at the plot, we can see that the
grey and yellow areas are related to each other. However, their branches
merge at a relatively high height, indicating that while they share some
similarities, their level of similarity is lower than expected. This
suggests that these clusters are distinct but still connected, though
their differences are significant enough to warrant separation at a
higher threshold.
Let’s use a phylogenic tree to see in more detail the clustering division between the counties and the distance that they have between them:
fviz_dend(x = hc,
k = 3,
color_labels_by_k = TRUE,
cex = 0.8,
type = "phylogenic",
repel = TRUE)+ labs(title="Socio-economic-demographic tree clustering of the counties of the state of Georgia") + theme(axis.text.x=element_blank(),axis.text.y=element_blank())
df_pca <- data.frame(
county_name = georgia_data$county_name,
PC1 = pca$x[,1], # First Principal Component
PC2 = pca$x[,2], # Second Principal Component
region = georgia_data$region,
stringsAsFactors = FALSE
)
# i want to create a Weighted Sum of PC1 and PC2 (based on their contribution on the model)
df_pca$PC_combined <- (0.7357 * df_pca$PC1) + (0.2643 * df_pca$PC2)
df_merge <- dplyr::left_join(df_pca, georgia_election, by = "county_name")
corr_value <- cor(df_merge$PC_combined, df_merge$pct_gop, use = "complete.obs")
plot_gg <- ggplot(df_merge, aes(x = PC_combined, y = pct_gop, color = region, text = county_name)) +
geom_point(alpha = 0.7) +
theme_minimal() +
labs(
title = paste0("PPC1 and PC2 vs % of votes for the GOP presidential candidate\nCorrelation = ",
round(corr_value, 3)),
x = "PC1 and PC2",
y = "% votes for GOP"
) +
theme(legend.position = "bottom")
ggplotly(plot_gg, tooltip = c("x", "y", "text"))
The correlation of 0.305 indicates a moderate positive relationship between the PC1+PC2 index and GOP vote percentage. Counties with higher PC1+PC2 values (indicating better socioeconomic conditions, a predominantly White population, and urban characteristics) tend to vote more Republican—but the trend is not very strong.
This, in some ways, challenges the common notion that rural areas with certain socioeconomic characteristics overwhelmingly vote for the Republican Party. It is also interesting to note the significant variation in voting patterns even among counties with similar PC1+PC2 scores. Some counties with low PC1+PC2 scores still exhibit high GOP vote percentages, suggesting that factors beyond socioeconomic and demographic indicators play a crucial role in shaping voting behavior in the state of Georgia.
One potentially relevant factor missing from this model is the level of trust and loyalty toward the Republican Party. In today’s political landscape, partisan alignment is increasingly shaped by personal identity and loyalty rather than traditional ideological differences. The role of central figures like Donald Trump has transcended conventional political ideologies, making party loyalty and personal trust in leaders a key driver of voting behavior, beyond socioeconomic and demographic factors.
options(tigris_use_cache = TRUE)
georgia_counties <- counties(state = "GA", cb = TRUE, class = "sf")
## Retrieving data for the year 2022
georgia_counties$GEOID <- as.integer(georgia_counties$GEOID)
groups.hc <- cutree(hc, k = 3)
georgia_map_data <- data.frame(
county_fips = georgia_data$county_fips, # County FIPS from your data
cluster = groups.hc # Clustering results
)
georgia_counties <- georgia_counties %>%
left_join(georgia_map_data, by = c("GEOID" = "county_fips"))
ggplot(data = georgia_counties) +
geom_sf(aes(fill = as.factor(cluster)), color = "black") + # Color by cluster
scale_fill_manual(values = rainbow(length(unique(groups.hc)))) + # Distinct colors per cluster
theme_minimal() +
labs(
title = "Georgia Counties Cluster Map",
fill = "Cluster"
)
The clustering has provide a clear distinction between rural, suburban,
and urban counties across Georgia.
The red counties (Cluster 1) are primarily located in central, southern, and eastern Georgia, representing largely rural areas that face significant economic and health disadvantages. These counties tend to have lower median household incomes, higher poverty rates, and lower education levels, contributing to higher unemployment and poorer health outcomes. Limited access to healthcare and essential services, combined with fewer economic opportunities, makes these areas particularly vulnerable. These counties are characteristic of regions where rural poverty and social inequalities are most pronounced.
The green counties (Cluster 2) occupy a middle ground, encompassing a mix of suburban and semi-rural areas spread throughout the state. While they do not experience the severe economic hardships of Cluster 1, they also do not enjoy the same level of prosperity as the most developed counties. These counties have moderate income levels, relatively lower unemployment, and better access to healthcare compared to the more disadvantaged areas. They reflect communities where economic stability is present, but challenges such as income disparities, education gaps, and healthcare accessibility still exist.
The blue counties (Cluster 3), on the other hand, are primarily found in northwestern Georgia and metropolitan areas, including the Atlanta metro region. These counties represent urban and developed areas where economic and social conditions are significantly stronger. They are characterized by higher median household incomes, lower poverty rates, and better education levels, resulting in higher life expectancy and improved overall well-being. Residents in these areas benefit from better access to healthcare, employment opportunities, and infrastructure, which contribute to their stronger economic standing.
merged_data <- left_join(georgia_election, georgia_map_data , by = "county_fips")
# Check the first few rows
head(merged_data)
## county_fips county_name total_votes pct_gop pct_dem pct_other cluster
## 1 13001 Appling County 8334 81.12551 18.71850 0.1559875 1
## 2 13003 Atkinson County 3057 76.87275 22.89827 0.2289827 1
## 3 13005 Bacon County 4839 86.50548 13.32920 0.1653234 1
## 4 13007 Baker County 1476 59.82385 39.97290 0.2032520 2
## 5 13009 Baldwin County 18825 50.85790 48.65339 0.4887118 1
## 6 13011 Banks County 10519 88.96283 10.79951 0.2376652 1
# Summarize average voting percentages per cluster
cluster_summary <- merged_data %>%
group_by(cluster) %>%
summarise(
avg_republicans = mean(`pct_gop`, na.rm = TRUE),
avg_democrats = mean(`pct_dem`, na.rm = TRUE)
)
# Print the summary
print(cluster_summary)
## # A tibble: 3 × 3
## cluster avg_republicans avg_democrats
## <int> <dbl> <dbl>
## 1 1 69.1 30.6
## 2 2 56.1 43.6
## 3 3 62.0 37.3
# Visualize the voting pattern per cluster
ggplot(merged_data, aes(x = as.factor(cluster), y = `pct_gop`, fill = as.factor(cluster))) +
geom_boxplot() +
labs(title = "Republican Vote Percentage by Cluster", x = "Cluster", y = "Republican Vote %") +
theme_minimal()
ggplot(merged_data, aes(x = as.factor(cluster), y = `pct_dem`, fill = as.factor(cluster))) +
geom_boxplot() +
labs(title = "Democratic Vote Percentage by Cluster", x = "Cluster", y = "Democratic Vote %") +
theme_minimal()
The common assumption that rural residents with lower levels of education overwhelmingly vote for the Republican Party, while urban, highly educated individuals consistently support the Democratic Party, is an oversimplification of a far more complex reality. By analyzing voting patterns in Georgia through PCA (Principal Component Analysis) and hierarchical clustering, we have demonstrated that political preferences cannot be reduced to a simple urban-rural divide.
The PCA results have revealed that a citizen’s vote is influenced by a wide range of socioeconomic and environmental conditions. Factors such as income, education level, race, poverty status, access to healthy food, personal health, and environmental conditions all contribute to shaping political preferences. However, none of these factors alone are sufficient to accurately predict how an individual will vote. This challenges the idea that demographic categories can serve as definitive predictors of political alignment.
Furthermore, the clustering analysis has shown that even in urbanized counties, where Democratic support is often assumed to be dominant, voting patterns are more balanced and diverse than expected. Not all city dwellers vote Democratic, just as not all rural residents vote Republican. Political behavior is nuanced, influenced by local economic conditions, historical context, cultural identity, and personal experiences rather than solely by geography or educational attainment.